Benchmarking of analysis strategies for data-independent acquisition proteomics using a large-scale dataset comprising inter-patient heterogeneity

Numerous software tools exist for data-independent acquisition (DIA) analysis of clinical samples, necessitating their comprehensive benchmarking. We present a benchmark dataset comprising real-world inter-patient heterogeneity, which we use for in-depth benchmarking of DIA data analysis workflows for clinical settings. Combining spectral libraries, DIA software, sparsity reduction, normalization, and statistical tests results in 1428 distinct data analysis workflows, which we evaluate based on their ability to correctly identify differentially abundant proteins. From our dataset, we derive bootstrap datasets of varying sample sizes and use the whole range of bootstrap datasets to robustly evaluate each workflow. We find that all DIA software suites benefit from using a gas-phase fractionated spectral library, irrespective of the library refinement used. Gas-phase fractionation-based libraries perform best against two out of three reference protein lists. Among all investigated statistical tests non-parametric permutation-based statistical tests consistently perform best.

NIPALS (Nonlinear Iterative Partial Least Squares) principal component analysis (PCA) was performed based on protein abundance of DIA analysis workflows following quantile normalization. Chronological order of measurement was in an ascending manner, indicated by the increasing darkness of label color. Sample 28, which belongs to spike-in condition 1:6, is not included in this plot as it represents an outlier due to a high degree of missing values. NIPALS was used, as it can directly be applied to data with missing values. Source data are provided as a Source Data file. Pearson correlation between log2 protein intensities of all DIA analysis workflows (using all complete pairs of observations). The calculated correlation is based on the 3966 proteins common to all DIA analysis workflows. Pearson correlation ranges from -1 (perfect negative correlation, red) to 1 (perfect positive correlation, blue). This is also indicated by the elliptic shape (round indicates lower correlation as opposed to a more narrow shape indicating a higher (positive) correlation). Source data are provided as a Source Data file.   Figure 5: Human proteins are identified and quantified equally across all spike-in conditions.
Log2 protein abundance distribution of human proteins separated by the four spike-in conditions. The overall median is indicated by the red line. The average number of identified human proteins per sample within each DIA analysis workflow is displayed above each violin plot. For spike-condition 1:6 data of n=22 biologically independent samples have been used and for each of the other spike-in conditions data of n=23 biologically independent samples have been used. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers). Source data are provided as a Source Data file. to the other DIA software suites, e.g. for example, the accumulation at 25% missing values, which originates from the human-only samples. FDR filtering applied to the OpenSwath protein data of the 92 lymph node samples using TRIC with an FDR of 1% (left), TRIC with an FDR of 5% (center), and PyProphet with an FDR of 1 % (right). PyProphet is used in our study. Human and E. coli proteins are depicted in blue and red, respectively. Source data are provided as a Source Data file.  !"#$"%$!"#$%&'#()"*+,-.$ ,-0'12+2$,-$,-*+).+0*$.+* Comparison of normalization options for no sparsity reduction (NoSR). pAUC was calculated based on the 'DIA workflow' reference protein list. The red line indicates the overall median. Each subplot is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. For each subplot that comes to a total of n=2100*3 sparsity reductions*7 statistical tests=44100 data points per normalization setting. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers). Source data are provided as a Source Data file. Comparison of statistical tests for no sparsity reduction (NoSR). pAUC was calculated based on the DIAworkflow reference protein list. The red line indicates the overall median. All seven statistical tests were two-sided and not adjusted for multiple testing. Each subplot is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. For each subplot that comes to a total of n=2100*3 sparsity reductions*4 normalizations=25200 data points per statistical test setting. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers). Source data are provided as a Source Data file.
Supplementary Figure 15: OpenSwath shows the highest variance in protein intensities (together with Skyline) and sample intensities, as well as the highest percentage of the contribution of the first and second principal component to the total variance.
Distribution of selected data characteristics for all bootstrap datasets of the different DIA analysis workflows. Terminology of data characteristics extensions (.wNAs and .woNAs indicate if proteins were included or excluded if they contained missing values): medianSampleVariance = median of the variances of the samples, medianProteinVariance = median of the variances of the proteins, percNATotal = percentage of missing values, kurtosis.wNAs = kurtosis, skewness.wNAs = skewness, var.groups.ratio.wNAs = median of the ratio of the protein variances of two groups (here: the two spike-in conditions 1:25 and 1:12), prcPC1.woNAs = percentage of the contribution of the first principal component to the total variance, prcPC2.woNAs = percentage of the contribution of the second principal component to the total variance. Each of the eight subplots is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. The boxplots show median (center line), interquartile range (IQR, extending from the first to the third quartile) (box), and 1.5 * IQR (whiskers). Source data are provided as a Source Data file. The analysis is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. That results in a total of n=2100*3 sparsity reductions*4 normalizations*7 statistical tests=176400 cases, for which both prediction performance as well as information on data characteristics are available. Source data are provided as a Source Data file.  Supplementary Figure 17 The correlation between the data characteristics varies with DIA analysis workflow Pearson correlations between data characteristics and performance measures of bootstrap datasets (using all complete pairs of observations) for all 17 DIA analysis workflows. Pearson correlation ranges from -1 (perfect negative correlation, red) to 1 (perfect positive correlation, blue). The terminology of the data characteristics and performance measures is analogous to Supplementary Figure 16. The analysis is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. That comes to a total of n=2100*3 sparsity reductions*4 normalizations*7 statistical tests=176400 cases, for which both prediction performance as well as information on data characteristics are available. Source data are provided as a Source Data file. Prediction performance of different statistical tests in pAUC over different sample sizes for protein reference lists a) Intersection, b) DiaWorkflow, and c) Combined. The lines, which are color-coded by statistical test, represent the median over the respective sample size. All three reference protein lists are similar in terms of the order of best performing statistical tests. The red line indicates the overall median. All seven statistical tests were two-sided and not adjusted for multiple testing.
Each subplot is based on n=2100 bootstrap datasets which have been generated by drawing with replacement from data of n=23 biologically independent samples of spike-in conditions 1:12 and 1:25, respectively. The sample size of these bootstrap datasets ranged from 3 to 23 samples, which due to drawing with replacement can appear multiple times. Each data point the lines are build upon represents the median of n=(2100/21)*3 sparsity reductions*4 normalizations=1200 data points. Source data are provided as a Source Data file.